A flexible statistical learning method will perform better.
Reason: Sample size of traning data is large, meaning that flexible models have potentioal to fit a wider range of possible shapes of f. However, inflexible with limited parameters cannot do this.
A flexible statistical learning method will perform worse.
Reason: Using this data set, Curse of Dimensionality will occure, flexible model will tend to be overfitting.
A flexible statistical learning method will perform better.
Reason: For a highly non-linear relationship, flexible model without limitations of parameters (more degrees of freedom) can find more accurate shape for f.
A flexible statistical learning method will perform worse.
Reason: In general, more flexible methodsd have higher variance.
Regeression problem. Inference.
n=500, p=profit, number of employees, industry
Classification problem. Prediction.
n=20, p=price charged, marketing budget, comp. price, ten other variables
Regression problem. Prediction.
n=48, p= change in US market, % change in British market, % change in German market
Plot
Advantages: When we use flexible methods, it can better fit for non-linear model and decrease bais.
Disadvantages: It requires the estimation of many parameters, tends to increase variance and overfit.
When we are interested in prediction, we prefer use flexible methods.
When we are interested in inference, we prefer less flexible methods.
For parametric approachs, we make explicit assuptuions about the functional form of f and we transform the problem of estimating f to estimate a set of parameters. For non-parametric function, there is no assuption about f and it requires a large number of observations to estimate f.
Advantages: It is easier to estimate parameters rather than fit an function.
Disadvantages: We don’t know the exact form of real f, so sometimes the parametric method we chose may be far from f (underfit or overfit).
The Euclidean distance between each observation and the test point:
Obs | Distance |
---|---|
1 | 3 |
2 | 2 |
3 | \(\sqrt{10}\) |
4 | \(\sqrt5\) |
5 | \(\sqrt{2}\) |
6 | \(\sqrt3\) |
When K=1, we chose objection 5 as the nearest neighbor, so the prediction is Green.
When k=3, we chose objection 2,5,6 as the nearest neighbor, so the prediction is Red.
When the Bayes boundary is highly non-linear, small value of k would be better.
The reason is that when k is small,the boundary would be very flexible. When K is large, it tries to fit a linear boundary
college=read.csv("D:/ISLR_hw/College.csv")
head(college)
## X Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University Yes 1660 1232 721 23
## 2 Adelphi University Yes 2186 1924 512 16
## 3 Adrian College Yes 1428 1097 336 22
## 4 Agnes Scott College Yes 417 349 137 60
## 5 Alaska Pacific University Yes 193 146 55 16
## 6 Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1 52 2885 537 7440 3300 450 2200 70
## 2 29 2683 1227 12280 6450 750 1500 29
## 3 50 1036 99 11250 3750 400 1165 53
## 4 89 510 63 12960 5450 450 875 92
## 5 44 249 869 7560 4120 800 1500 76
## 6 62 678 41 13500 3335 500 675 67
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1 78 18.1 12 7041 60
## 2 30 12.2 16 10527 56
## 3 66 12.9 30 8735 54
## 4 97 7.7 37 19016 59
## 5 72 11.9 2 10922 15
## 6 73 9.4 11 9727 55
rownames(college)=college[,1]
fix(college)
Try
college=college[,-1]
fix(college)
summary(college)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
pairs(college[,1:10])
boxplot(Outstate~Private, college)
###### iv
Elite=rep('No',nrow(college))
Elite[college$Top10perc>50]='Yes'
Elite=as.factor(Elite)
college=data.frame(college,Elite)
summary(college$Elite)
## No Yes
## 699 78
boxplot(Outstate~Elite,college)
par(mfrow=c(2,2))
hist(college$Apps)
hist(college$Accept)
hist(college$Enroll)
hist(college$PhD)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
phd_private=mean(filter(college,Private=='Yes')$PhD)
phd_pubilc=mean(filter(college,Private=='No')$PhD)
In average, public universities has more PhD stduets, meaning that they are more focued on research.
college_1=mutate(college,Accept_rate=Accept/Apps)
Accept_elite=mean(filter(college_1,Elite=='Yes')$Accept_rate)
Accept_nonelite=mean(filter(college_1,Elite=='No')$Accept_rate)
In average, elite universities has lower acceptance rare than non-elite ones.
remove missing value from the data
auto=read.csv("D:/ISLR_hw/Auto.csv", na.strings="?")
auto=na.omit(auto)
head(auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
We can see that mpg, cylinders, displacement, horsepower, weight, year and acceleration are quantitative, origin and name are qualitative.
apply(auto,2,range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] " 9.0" "3" " 68.0" " 46" "1613" " 8.0" "70"
## [2,] "46.6" "8" "455.0" "230" "5140" "24.8" "82"
## origin name
## [1,] "1" "amc ambassador brougham"
## [2,] "3" "vw rabbit custom"
#sapply(auto[,1:7],range)
sapply(auto[, 1:7], mean)
## mpg cylinders displacement horsepower weight
## 23.445918 5.471939 194.411990 104.469388 2977.584184
## acceleration year
## 15.541327 75.979592
sapply(auto[, 1:7], sd)
## mpg cylinders displacement horsepower weight
## 7.805007 1.705783 104.644004 38.491160 849.402560
## acceleration year
## 2.758864 3.683737
auto_new=auto[-(10:85),]
#Auto[-(10:85),]
apply(auto_new,2,range)
## mpg cylinders displacement horsepower weight acceleration year
## [1,] "11.0" "3" " 68" " 46" "1649" " 8.5" "70"
## [2,] "46.6" "8" "455" "230" "4997" "24.8" "82"
## origin name
## [1,] "1" "amc ambassador brougham"
## [2,] "3" "vw rabbit custom"
sapply(auto_new[, 1:7], mean)
## mpg cylinders displacement horsepower weight
## 24.404430 5.373418 187.240506 100.721519 2935.971519
## acceleration year
## 15.726899 77.145570
sapply(auto_new[, 1:7], sd)
## mpg cylinders displacement horsepower weight
## 7.867283 1.654179 99.678367 35.708853 811.300208
## acceleration year
## 2.693721 3.106217
library(ggplot2)
#library(GGally)
pairs(auto)
ggplot(auto,aes(weight,displacement))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Here we can see that the displacemnet and weights are postively correlated.
From the plots in (e), we can see almost all predictors are correlated with mpg except name.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
?Boston
## starting httpd help server ...
## done
There are 506 rows and 14 columns.
Each row reprensents an housing with their attributes. Each column represents a set of one attributes of a housing.
pairs(Boston)
ggplot(Boston,aes(rad,crim))+geom_point()
It can be seen that Higher index of accessibility to radial highways, more crime.
Use conclusion in (b).
hist(Boston[Boston$crim>1,]$crim, breaks=25)
hist(Boston$tax, breaks=25)
hist(Boston$ptratio, breaks=25)
For crime rate, most towns have low crime rates, but there are tails: some suburbs have crime rates over 20, reaching above 80.
For tax, there is a big gap between low taxes towns and hign taxes towns, and the peak is around.
For pupil-teacher ratio, there is a skew.
nrow(Boston[Boston$chas==1,])
## [1] 35
There are 35 suburbs bound the Charles river.
median(Boston['ptratio'][,1])
## [1] 19.05
median is 19.05
subset(Boston,medv==min(Boston$medv))
## crim zn indus chas nox rm age dis rad tax ptratio black
## 399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.90
## 406 67.9208 0 18.1 0 0.693 5.683 100 1.4254 24 666 20.2 384.97
## lstat medv
## 399 30.59 5
## 406 22.98 5
The 399th and 406th suburb has lowest median value of owner-occupies homes. (We can use quantitles of the above values to compare)
dim(subset(Boston,rm>7))[1]
## [1] 64
dim(subset(Boston,rm>8))[1]
## [1] 13
There are 64 suburbs avarage more than 7 rooms, 13 suburbs more than 8 rooms.
summary(subset(Boston,rm>8))
## crim zn indus chas
## Min. :0.02009 Min. : 0.00 Min. : 2.680 Min. :0.0000
## 1st Qu.:0.33147 1st Qu.: 0.00 1st Qu.: 3.970 1st Qu.:0.0000
## Median :0.52014 Median : 0.00 Median : 6.200 Median :0.0000
## Mean :0.71879 Mean :13.62 Mean : 7.078 Mean :0.1538
## 3rd Qu.:0.57834 3rd Qu.:20.00 3rd Qu.: 6.200 3rd Qu.:0.0000
## Max. :3.47428 Max. :95.00 Max. :19.580 Max. :1.0000
## nox rm age dis
## Min. :0.4161 Min. :8.034 Min. : 8.40 Min. :1.801
## 1st Qu.:0.5040 1st Qu.:8.247 1st Qu.:70.40 1st Qu.:2.288
## Median :0.5070 Median :8.297 Median :78.30 Median :2.894
## Mean :0.5392 Mean :8.349 Mean :71.54 Mean :3.430
## 3rd Qu.:0.6050 3rd Qu.:8.398 3rd Qu.:86.50 3rd Qu.:3.652
## Max. :0.7180 Max. :8.780 Max. :93.90 Max. :8.907
## rad tax ptratio black
## Min. : 2.000 Min. :224.0 Min. :13.00 Min. :354.6
## 1st Qu.: 5.000 1st Qu.:264.0 1st Qu.:14.70 1st Qu.:384.5
## Median : 7.000 Median :307.0 Median :17.40 Median :386.9
## Mean : 7.462 Mean :325.1 Mean :16.36 Mean :385.2
## 3rd Qu.: 8.000 3rd Qu.:307.0 3rd Qu.:17.40 3rd Qu.:389.7
## Max. :24.000 Max. :666.0 Max. :20.20 Max. :396.9
## lstat medv
## Min. :2.47 Min. :21.9
## 1st Qu.:3.32 1st Qu.:41.7
## Median :4.14 Median :48.3
## Mean :4.31 Mean :44.2
## 3rd Qu.:5.12 3rd Qu.:50.0
## Max. :7.44 Max. :50.0
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
More rm, lower crime